import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import warnings
import patsy
import seaborn as sns
import sklearn
import sklearn.model_selection
import sklearn.ensemble
from sklearn import (linear_model, metrics, neural_network, pipeline, model_selection, tree)
warnings.filterwarnings("ignore")
%matplotlib inline
url = "https://raw.githubusercontent.com/ljy0401/ECON-323-Group-Project/main/amazon.csv"
amazon_raw = pd.read_csv(url)
# Remove unnecessary columns
amazon_clean = amazon_raw[['category', 'discounted_price', 'actual_price', 'discount_percentage',
'rating', 'rating_count', 'user_id', 'review_title']]
# Type conversions
for col in ["discounted_price", "actual_price", "rating_count"]:
amazon_clean[col] = amazon_clean[col].str.replace("₹", "")
amazon_clean[col] = amazon_clean[col].str.replace(",", "")
amazon_clean["discount_percentage"] = amazon_clean["discount_percentage"].str.replace("%", "")
amazon_clean['rating'] = amazon_clean["rating"].str.replace("|", "")
amazon_clean[["discounted_price", "actual_price", "discount_percentage", "rating",
"rating_count"]] = amazon_clean[["discounted_price", "actual_price", "discount_percentage", "rating",
"rating_count"]].apply(pd.to_numeric, errors='coerce')
amazon_clean['discount_percentage']=amazon_clean['discount_percentage']/100
amazon_filtered = amazon_clean[["discounted_price", "actual_price", "rating", "rating_count"]]
amazon_filtered
| discounted_price | actual_price | rating | rating_count | |
|---|---|---|---|---|
| 0 | 399.0 | 1099.0 | 4.2 | 24269.0 |
| 1 | 199.0 | 349.0 | 4.0 | 43994.0 |
| 2 | 199.0 | 1899.0 | 3.9 | 7928.0 |
| 3 | 329.0 | 699.0 | 4.2 | 94363.0 |
| 4 | 154.0 | 399.0 | 4.2 | 16905.0 |
| ... | ... | ... | ... | ... |
| 1460 | 379.0 | 919.0 | 4.0 | 1090.0 |
| 1461 | 2280.0 | 3045.0 | 4.1 | 4118.0 |
| 1462 | 2219.0 | 3080.0 | 3.6 | 468.0 |
| 1463 | 1399.0 | 1890.0 | 4.0 | 8031.0 |
| 1464 | 2863.0 | 3690.0 | 4.3 | 6987.0 |
1465 rows × 4 columns
Intuitively speaking, the price, volume of sale and customers' ratings of a good are always inter-correlated. For instance, if a good is sold at a relatively low price and customers' experience of using such a good is still above the average, then reasonably the competitivity of this good would be guaranteed and the sale volume would be predicted to relatively high.
Therefore, in order to investigate the how the variations between goods are contributed by the combination of price, sale volume and customers' rating, on our second last stage of Exploratory Data Analysis, we decide to conduct Principal Component Analysis on our data. We would assume that our data approximately lies on a hyperplane and Euclidean distance would be used. Note that we would exclude the categorical variable "category" from our data.
Data preprocessing is important. Usually PCA is conducted through applying Singular Value Decomposition(SVD) on the centered data or applying Eigenvalue Decomposition on the covariance matrix of the centered data, and thus centering will be necessary. Meanwhile, the unit of variable may influence the performance of PCA since variable in larger scale usually contribute more on the overall variation, and thus we may also scale the centered data. In other words, standardization is required. Meanwhile, ourliers would greatly affect the performance of PCA as well, and so they would be excluded before conducting PCA. Last but not least, standard PCA doesn't apply to data with missing value, but fortunately normalization would help us get rid of rows with missing values.
# Standardize the data using StandardScaler
from sklearn.preprocessing import StandardScaler
amazon_filtered_normalized = StandardScaler().fit_transform(amazon_filtered)
amazon_filtered_normalized = pd.DataFrame(amazon_filtered_normalized).rename(columns = {0:"discounted_price",
1:"actual_price",
2:"rating",
3:"rating_count"})
# Remove the outliers in data, where the interval is between 0.15 quantile and 0.85 quantile in each column
for i in range(4):
amazon_filtered_normalized = amazon_filtered_normalized[amazon_filtered_normalized.iloc[:,i].between(amazon_filtered_normalized.iloc[:,i].quantile(.15), amazon_filtered_normalized.iloc[:,i].quantile(.85))]
# Using side-by-side histograms to visualize the distribution of the preprocessed variables
figure, ax = plt.subplots(1,4, figsize = (16,4))
figure.suptitle("Distribution of each numerical variable after preprocessing", size=14)
figure.supxlabel("variable value after standardization and exclusion of outliers")
for i in range(4):
amazon_filtered_normalized.iloc[:,i].plot(
kind="hist", y=amazon_filtered_normalized.iloc[:,i].name, color=((20*i+50)/255, (20*i+50)/255, (20*i+70)/255),
bins=23, legend=False, density=True, ax=ax[i])
ax[i].set_title(amazon_filtered_normalized.iloc[:,i].name)
We can see except for rating, the three plots left are all right skewed. This phenomenon will be useful later.
# Apply the PCA on preprocessed data
from sklearn.decomposition import PCA
pca_amazon = PCA(n_components=4)
PCA_fit_Amazon = pca_amazon.fit_transform(amazon_filtered_normalized)
Determining the number of principal components K involved would be the next stage of PCA, and it is usually measured by the proportion of variation explained by each PC. There are actually many methods that have been proposed to determine the optimal K for PCs, where some of them are more EDA-based and some of them take randomness and the prediction ability of PCA into considerations. From the perspective of EDA, scree-test/scree-plot is one of the most popular and widely-used methods. If it is hard to draw conclusion from scree-plot, then alternatively we can pick the PCs whose eigenvalues is larger than average, which is always 1 in this context.
# The scree-plot of PCA (basically no need to check eigenvalues)
fig, ax = plt.subplots(figsize=(6,4))
PC_values = np.arange(pca_amazon.n_components_) + 1
plt.plot(PC_values, pca_amazon.explained_variance_ratio_, 'o-', linewidth=2, color='black')
plt.title('Scree-Plot for PCA on Amazon Sale Dataset')
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Total Variance Explained')
for index in range(len(PC_values)):
if index ==1:
ax.text(2.05, 0.11,
round(pca_amazon.explained_variance_ratio_[index],3),
size=12)
elif index ==0:
ax.text(1.1, 0.8,
round(pca_amazon.explained_variance_ratio_[index],3),
size=12)
else:
ax.text(PC_values[index]-0.2,
pca_amazon.explained_variance_ratio_[index]+0.03,
round(pca_amazon.explained_variance_ratio_[index],3),
size=12)
plt.show()
Based on the scree-plot, there is an apparent elbow point at p=2, and the total variation explained by the first two principal components would be 86.4% + 9.8% = 96.2%, which is fairly reasonable. On average, the contribution of each eigenvalue should be 25%. We can see that only first PC contributed more than that, but it will be very risky to leave only one principal component. Therefore, we would keep PC1 and PC2.
Also, back to our motivation to conduct PCA, the most we wonder is how price, rating and rating counts contribute to those important principal components. For this, we may check the coefficients of each variable in the loading of each PC.
# Illustrate the loadings for each PC, which is the correlation between raw variables and principal components
PCA_Amazon_loading = pca_amazon.components_
PC_list_in_string = ["PC"+str(i) for i in list(range(1, 5))]
PCA_Amazon_loading_df = pd.DataFrame.from_dict(dict(zip(PC_list_in_string, PCA_Amazon_loading)))
PCA_Amazon_loading_df['variable'] = amazon_filtered_normalized.columns.values
PCA_Amazon_loading_df = PCA_Amazon_loading_df.set_index('variable')
PCA_Amazon_loading_df
ax = sns.heatmap(PCA_Amazon_loading_df, annot=True)
plt.show()
As we can see in the loading plot, PC1 is mainly determined by rating, and PC2 is maining determined by rating count. Since based on the scree-plot above the first two PCs are capable to explain around 96% percent of the variation, we may say that rating and rating_count are the main contributors to the difference between these goods.
Meanwhile, another implication is that if we are going to establish any regression model for our data, then it is reasonable to believe that rating and rating_count would be more likely to be statistically significant variables
Besides how the variation of data can be decomposed, we are also interested in if the variation in data lead to specific structure of data in the context of price and customers' feedback, where another unsupervised learning method may be explored: clustering. Clustering is usually used to determine how points can form different clusters based on their similarity. The most popular methods may be K-means clustering, Guassian model-based clustering and hierarchical clutering may. Considering that we lack the evidence for the assumption that data is generated under a mixture of guassian distributions, and the visualization of hierarchical clustering may be blurred, K-means clustering seems to be an appropriate choice here.
Note that since our data is in relatively low dimensionality, we would rather applying clustering to raw variables instead of principle components, and it is always helpful for our interpretations. Also, since clustering is also based on the Euclidean distance between data points, standardization is still necessary.
The result of clustering should be intuitively interpretable, but the randomness in selecting the initial centriods to start may lead to results that are counter-intuitive, so for each number of cluster we may need to run the code multiple times. Meanwhile, we also need to determine the best number of clusters fitted. Usually the criterions can be CH index or within-cluster sum of squared distance. Here we would use the latter.
# import necessary package
from sklearn.cluster import KMeans
# For k from 2 to 8, run k-means clustering on data for 15 times
k_range = np.arange(2,9)
mean_wcss = np.zeros(7)
for k in range(2,9):
wcss_value_store = np.zeros(15)
for i in range(15):
model = KMeans(n_clusters=k,
random_state=i,
max_iter=200)
amazon_kmeans = model.fit(amazon_filtered_normalized)
wcss_value_store[i] = amazon_kmeans.inertia_
mean_wcss[k-2] = wcss_value_store.mean()
# Create a data frame to store the k and corresponding within-cluster sum of squared distance
wcss_against_k = pd.DataFrame(mean_wcss, columns = ["Within Cluster Sum of Squared Distance"])
wcss_against_k["K Values"] = k_range
wcss_against_k = wcss_against_k.sort_values(["Within Cluster Sum of Squared Distance"], ascending=False)
# Using bar-plot
plt.figure(figsize=(6, 6))
amazon_bar_plots = sns.barplot(x="K Values", y="Within Cluster Sum of Squared Distance", data=wcss_against_k)
for bar in amazon_bar_plots.patches:
amazon_bar_plots.annotate(format(bar.get_height(), '.2f'),
(bar.get_x() + bar.get_width() / 2,
bar.get_height()),
ha='center', va='center',
size=10, xytext=(0, 8), textcoords='offset points')
plt.title("Scree Plot for K-means Clustering")
plt.plot(k_range-2, mean_wcss, 'o-', linewidth=1.3, color = "black")
[<matplotlib.lines.Line2D at 0x7fc0ced84640>]
Based on the scree-plot, it is relatively blur to determine the elbow-point here. It may be safe to say that the the elbow-point is at k equal to 3 (4 may also work, but the largest drop is at 3), and we are going to validate the result with our intuition.
As what we emphasized before, the intuitive interpretation of clustering output matters. However, it would be visually difficult to examine the reasonability of result, since the dimensionality of data is larger than 3. Considering that for customers discounted price may be more important, we may only use discounted_price, rating and rating_count as axis to visualize the clustering result for further interpretation, and we will repeatedly fit clusters until there is feasible interpretation for the clustering result.
# train a K-means model using the filtered data, and mutate the data
kmeans_optimal = KMeans(n_clusters=3, init="random", max_iter=200)
amazon_kmeans_optimal = kmeans_optimal.fit(amazon_filtered_normalized)
amazon_kmeans_for_plot = amazon_filtered_normalized.copy()
amazon_kmeans_for_plot["cluster"] = amazon_kmeans_optimal.labels_+1
amazon_kmeans_for_plot_short = amazon_kmeans_for_plot.drop(columns = "actual_price")
# use a 3D plot to visualize the result of clustering
import plotly.graph_objs as go
fig = plt.figure(figsize = (7,7))
discounted_price = np.array(amazon_kmeans_for_plot_short['discounted_price'])
rating = np.array(amazon_kmeans_for_plot_short['rating'])
rating_count = np.array(amazon_kmeans_for_plot_short['rating_count'])
Trace = go.Scatter3d(x=discounted_price, y=rating, z=rating_count,
mode='markers',
marker=dict(color = amazon_kmeans_for_plot_short["cluster"], size= 10,
line=dict(color= 'black',width = 10)))
Scene = dict(xaxis = dict(title = 'Discounted Price'),
yaxis = dict(title = 'Rating'),
zaxis = dict(title = 'Rating Count'))
Layout = go.Layout(margin=dict(l=0,r=0), scene = Scene, height = 800, width = 800)
fig = go.Figure(data = Trace, layout = Layout)
fig.show()
<Figure size 700x700 with 0 Axes>
We noticed that no matter how many times we run the K-means clustering, the results will be similar and it is rating that mainly determines the relative positions of clusters. Note that we are not superised about such a pattern, such the result of clustering reaches a consistency with the result of principal component analysis: from PCA we know that it is mainly rating (and partially rating count) that determines the variation in our data. Therefore, it is reasonable to see that the interior structure in data, which we defined as clusters here, is dominated by rating, and there is no need to examine other clustering algorithm on our data.
# use side-by-side plots to illustrate the difference in different clusters
fig, axes = plt.subplots(ncols=4, figsize=(12, 5), sharey=False)
amazon_kmeans_for_plot.boxplot(by='cluster', return_type='axes', ax=axes)
actual_price Axes(0.1,0.15;0.173913x0.75) discounted_price Axes(0.308696,0.15;0.173913x0.75) rating Axes(0.517391,0.15;0.173913x0.75) rating_count Axes(0.726087,0.15;0.173913x0.75) dtype: object
Again, the results shown by boxplots are consistent with the results of PCA and Clustering. We can clearly observe that for actual price, discounted price and rating count, the majority of points are distributed under the mean, which implicitly matches with the skewness of those histograms above. Note that the outliers here are the outliers after the raw outliers are excluded, which indicates that for these three variables it is several high outliers that balance with the the rest of data points.